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Abstract 



We present a linear regression method for predictions on a small data 
set making use of a second possibly biased data set that may be much 
^ larger. Our method fits linear regressions to the two data sets while 

Pj penalizing the difference between predictions made by those two models. 

^H The resulting algorithm is a shrinkage method similar to those used in 

^H small area estimation. Our main result is a Stein-type finding for Gaussian 

responses: when the model has 5 or more coefficients and 10 or more error 
degrees of freedom, it becomes inadmissible to use only the small data set, 
CZ5 no matter how large the bias is. We also present both plug-in and AICc- 

based methods to tune the penalty parameter. Most of our results use an 
^ L2 penalty, but we also obtain formulas for Li penalized estimates when 

^ the model is specialized to the location setting. 

Keywords: Data fusion, Small area estimation. Stein estimation. Transfer learn- 
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^ 1 Introduction 

I The problem we consider here is how to combine linear regressions based on data 

• • from two sources. There is a small data set of expensive high quality observa- 

. 1^ tions and a possibly much larger data set with less costly observations. The big 

S^ data set is thought to have similar but not identical statistical characteristics 

5-H to the small one. The conditional expectation might be different there or the 

predictor variables might have been measured in somewhat different ways. The 
motivating application comes from within Google. The small data set is a panel 
of consumers, selected by a probability sample, who are paid to share their 
internet viewing data along with other data on television viewing. There is a 
second and potentially much larger panel, not selected by a probability sample 
who have opted in to the data collection process. 

The goal is to make predictions for the population from which the smaller 
sample was drawn. If the data are identically distributed in both samples, we 
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should simply pool them. If the big data set is completely different from the 
small one, then it makes sense to ignore it and fit only to the smaller data set. 

Many settings are intermediate between these extremes: the big data set is 
similar but not necessarily identical to the small one. We stand to benefit from 
using the big data set at the risk of introducing some bias. Our goal is to glean 
some information from the larger data set to increase accuracy for the smaller 
one. The difficulty is that our best information about how the two populations 
are similar is our samples from them. 

The motivating problem at Google has some differences from the problem 
we consider here. There were response variables observed in the small sample 
that were not observed in the large one and the goal was to study the joint 
distribution of those responses. That problem also had binary responses instead 
of the continuous ones considered here. This paper studies linear regression 
because it is more amenable to theoretical analysis and thus allows us to explain 
the results we saw. 

The linear regression method we use is a hybrid between simply pooling 
the two data sets and fitting separate models to them. As explained in more 
detail below, we apply shrinkage methods penalizing the difference between the 
regression coefficients for the two data sets. Both the specific penalties we use, 
and our tuning strategies, reflect our greater interest in the small data set. Our 
goal is to enrich the analysis of the smaller data set using possibly biased data 
from the larger one. 

Section [2] presents our notation and introduces Li and L2 penalties on the 
parameter difference. Most of our results are for the L2 penalty. For the L2 
penalty, the resulting estimate is a linear combination of the two within sample 
estimates. TheoremjTJgives a formula for the degrees of freedom of that estimate. 
Theorem [2] presents the mean squared error of the estimator and forms the basis 
for plug-in estimation of an oracle's value when an L2 penalty is used. 

Section I3] considers in detail the case where the regression simplifies to esti- 
mation of a population mean. In that setting, we can determine how plug-in, 
bootstrap and cross-validation estimates of tuning parameters behave. We get 
an expression for how much information the large sample can add. Theorem [3] 
gives a soft-thresholding expression for the estimate produced by Li penaliza- 
tion and Theorem H can be used to find the penalty parameter that an Li oracle 
would choose when the data are Gaussian. 

Section |4] presents some simulated examples. We simulate the location prob- 
lem and find that numerous L2 penalty methods are admissible, varying in how 
aggressively they use the larger sample. The Li oracle is outperformed by the 
L2 oracle in this setting. When the bias is small, the data enrichment methods 
improve upon the small sample, but when the bias is large then it is best to use 
the small sample only. Things change when we simulate the regression model. 
For dimension d ^ 5, data enrichment outperforms the small sample method 
in our simulations at all bias levels. We did not see such an inadmissibility 
outcome when we simulated cases with d ^ 4. 

Section |5] presents our main theoretical result, Theorem[5] When there are 5 
or more predictors and 10 or more degrees of freedom for error, then some of our 



data enrichment estimators make simply using the small sample inadmissible. 
The reduction in mean squared error is greatest when the bias is smallest, but 
no matter how large the bias is, we gain an improvement. This result is similar 



to Stein's classic result on estimation of a Gaussian mean (Stein 19561, but 
the critical threshold here is dimension 5, not dimension 3. The estimator we 
study employs a data-driven weighting of the two within-sample least squares 
estimators. We believe that our plug-in estimator is even better than this one. 
There are many ideas in different literatures on combining non-identically 
distributed data sets in order to share or borrow statistical strength. Of these, 
the closest to our work is small area estimation (Rao 2003) used in survey 



sampling. In chemometrics there is a similar problem called transfer calibration 



(Feudale et al. 2002). Medicine and epidemiology among other fields use meta- 



analysis (Borenstein et al. 2009). Data fusion (D'Orazio et al. 2006) is widely 



used in marketing. The problem has been studied for machine learning where 
it is called transfer learning. An older machine learning term for the underly- 
ing issue is concept drift. Bayesian statisticians use hierarchical models. Our 
methods are more similar to empirical Bayes methods, drawing heavily on ideas 
of Charles Stein. A Stein-like result also holds for multiple regression in the 
context of just one sample. The result is intermediate between our two sample 
regression setting and the one sample mean problem. In regression, shrinkage 
makes the usual MLE inadmissible when in dimension p ^ 4 (with the intercept 
counted as one dimension) and a sufficiently large n. See Copas (1983) for a 



discussion of shrinkage in regression and Stein ( 1960 ) who also obtained this 



result for regression, but under stronger assumptions. 

A more detailed discussion of these different but overlapping literatures is 
in Section ^ Some of our proofs are given in an (online) Appendix. 

There are also settings where one might want to use a small data set to enrich 
a large one. For example the small data set may have a better design matrix or 
smaller error variance. Such possibilities are artificial in the motivating context 
so we don't investigate them further here. 



2 Data enrichment regression 

Consider linear regression with a response y e M and predictors AT e M"^'. The 
model for the small data set is 

r, = A,/3 + e„ leS 

for a parameter /? e M'^ and independent errors Si with mean and variance 
(t|. Now suppose that the data in the big data set follow 

y, = A,(/3 + 7)+£„ ieB 

where 7 S K'' is a bias parameter and £i are independent with mean and 
variance a^. The sample sizes are n in the small sample and N in the big 
sample. 



There are several kinds of departures of interest. It could be, for instance, 
that the overall level of Y is different in S than in B but that the trends are 
similar. That is, perhaps only the intercept component of 7 is nonzero. More 
generally, the effects of some but not all of the components in X may differ in 
the two samples. One could apply hypothesis testing to each component of 7 
but that is unattractive as the number of scenarios to test for grows as 2"^. 

Let Xs e K"'''' and Xb G ^^""^ have rows made of vectors Xi ioi i e S 
and i ^ B respectively. Similarly, let I5 € M" and Yg € M.^ be corresponding 
vectors of response values. We use Vs — XjXs and Vb ~ XJ^Xb- 

2.1 Partial pooling via shrinkage and weighting 

Our primary approach is to pool the data but put a shrinkage penalty on 7. We 
estimate /3 and 7 by minimizing 

^(r, - X,/?)2 + J2{Y^ X,{I3 + 7))2 + AP(7) (1) 

ies ieB 

where A G [0, 00] and P{"f) ^ is a penalty function. There are several reason- 
able choices for the penalty function, including 

II7II2, WXslWl hill, and ||Xs7||i. 

For each of these penalties, setting A = leads to separate fits /3 and /? + 7 in 
the two data sets. Similarly, taking A = 00 constrains 7 = and amounts to 
pooling the samples. In many applications one will want to regularize /3 as well, 
but in this paper we only penalize 7. 

The Li penalties have an advantage in interpretation because they identify 
which parameters or which specific observations might be differentially affected. 
The quadratic penalties are simpler, so we focus most of this paper on them. 

Both quadratic penalties can be expressed as ||ArT7||| for a matrix Xt- 
The rows of Xt represent a hypothetical target population of Nt items for 
prediction. Or more generally, the matrix S = Ey = X^Xt is proportional to 
the matrix of mean squares and mean cross-products for predictors in the target 
population. 

If we want to remove the pooling effect from one of the coefficients, such 
as the intercept term, then the corresponding column of Xt should contain all 
zeros. We can also constrain 7^ — (by dropping its corresponding predictor) 
in order to enforce exact pooling on the j'th coefficient. 

A second, closely related approach is to fit f3s by minimizing X]ies(^* ^ 
XilS)"^, fit (Sb by minimizing X^iesl^* ^ XiP)'^, and then estimate (3 by 

^{uj)^iol3s + {l-uj)pB 

for some ^ u; ^ 1. In some special cases the estimates indexed by the weighting 
parameter lj E [n/{n + N), 1] are a relabeling of the penalty-based estimates 
indexed by the parameter A G [0, 00]. In other cases, the two families of estimates 



differ. The weighting approach ahows simpler tuning methods. Although we 
think that the penalization method may be superior, we can prove stronger 
results about the weighting approach. 

Given two values of A we consider the larger one to be more 'aggressive' in 
that it makes more use of the big sample bringing with it the risk of more bias 
in return for a variance reduction. Similarly, aggressive estimators correspond 
to small weights w on the small target sample. 

2.2 Special cases 

An important special case for our applications is the cell partition model. In 
the cell partition model, Xi is a vector containing C — 1 zeros and one 1. The 
model has C different cells in it. Cell c has N^ observations from the large data 
set and ric observations from the small data set. In an advertising context a cell 
may correspond to one specific demographic subset of consumers. The cells may 
be chosen exogenously to the given data sets. When the cells are constructed 
using the regression data then cross-validation or other methods should be used. 

A second special case, useful in theoretical investigations, has XjXg ex 
XqXb- This is the proportional design matrix case. 

The simplest case of all is the location model. It is the cell mean model 
with C = 1 cell, and it has proportional design matrices. We can get formulas 
for the optimal tuning parameter in the location model and it is also a good 
workbench for comparing estimates of tuning parameters. Furthermore, we are 
able to get some results for the ii case in the location model setting. 

2.3 Quadratic penalties and degrees of freedom 

The quadratic penalty takes the form P(7) = II-^ttIH = 7^Vt7 for a matrix 
Xt G ^'''"^ and Vt = XJ,Xt G M'^^''. The value r is d or n in the examples 
above and could take other values in different contexts. Our criterion becomes 

\\Ys ~ Xsl3f + \\Yb- Xb{P + 7)11' + A||XT7f- (2) 

Here and below ||a;|| means the Euclidean norm ||a;||2. 

Given the penalty matrix Xt and a value for A, the penalized sum of 
squares Q is minimized by j3x and 7a satisfying 



x'^x {"^] = x^y 



where 

A'-IXb Xb I e M("+^'+'-)x2d^ and ^ - I Fb I • (3) 




To avoid uninteresting complications we suppose that the matrix X^ X is 
invertible. The representation (|3| also underlies a convenient computational 



approach to fitting (3\ and 7a using r rows of pseudo-data just as one does in 
ridge regression. 

The estimate /3a can be written in terms of /3s — Vg^XjYs and Pb = 
Vg^ XJ^Yb as the next lemma shows. 

Lemma 1. Let Xs, Xb, and Xt in ([2| all have rank d. Then for any A ^ 0, 
the niinimizers f3 and 7 o/ ([2| satisfy 

/3 = Wx^s + {I~ Wx)Pb 
and 7 = {Vb + ^Vt)~^Vb{$b ~ /3) for a matrix 

Wx = {Vs + XVtVsWs + XVt)-HVs + WtV^Ws). (4) 

IfVT = Vs, then 

Wx = (Vb + XVs + XVb)-HVb + Ws). 
Proof. The normal equations of (pi) are 

{Vb + Vs)$ = VsPs + VbPb - VbI and (Fs + \Vt)i = Vb^b - VbP- 

Solving the second equation for 7, plugging the result into the first and solving 
for /3, yields the result with Wx = {Vs + Vb- Vb{Vb + Wry^VBy^Vs- This 
expression for Wx simplifies as given and simplifies further when Vt — Vs- □ 

The remaining challenge in model fitting is to choose a value of A. Because 
we are only interested in making predictions for the S data, not the B data, 
the ideal value of A is one that optimizes the prediction error on sample S. One 
reasonable approach is to use cross-validation by holding out a portion of sample 
S and predicting the held-out values from a model fit to the held-in ones as well 
as the entire B sample. One may apply either leave-one-out cross-validation or 
more general K-io\d cross-validation. In the latter case, sample S is split into K 
nearly equally sized parts and predictions based on sample B and K — 1 parts 
of sample S are used for the K^th held-out fold of sample S. 

In some of our applications we prefer to use criteria such as AIC, AICc, 
or BIG in order to avoid the cost and complexity of cross-validation. These 
alternatives are of most value when data enrichment is itself the inner loop of a 
more complicated algorithm. 

To compute AIC and alternatives, we need to measure the degrees of freedom 



used in fitting the model. We follow Ye (1998) and Efron (2004) in defining the 
degrees of freedom to be 

df(A) = ^^cov(i;,r,), (5) 



ies 



where Ys = Xs$x- Because of our focus on the S data, only the S data appear 
in the degrees of freedom formula. We will see later that the resulting AIC 
type estimates based on the degrees of freedom perform similarly to our focused 
cross-validation described above. 



Theorem 1. For data enriched regression the degrees of freedom given at ([5| 
satisfies df(A) = tr(WA) where W\ is given in LemmaYn IfVr — Vs, then 



' l + Xu 



df (A) = y ' ; ^'i (6) 

where vi, ... ,1^4 o-fc the eigen-values of Vg Vg Vg in which Vg is a sym- 
metric matrix square root of Vs ■ 



Proof. Please see Section 8.1 in the Appendix. D 



With a notion of degrees of freedom customized to the data enrichment 
context we can now define the corresponding criteria such as 

2^^(^)'> and 



AIC(A) = nlog(<7|(A)) + n(l + 
AICc(A) . nlog(^|(A)) + n(l + ^) / (l - '^^) , (7) 

where (tKA) = (n—d)^^ jyi£sO^^~ ■^^(^W)'^ ■ '^^'^ ^^^ ^^ more appropriate than 
BIC here since our goal is prediction accuracy, not model selection. We prefer 



the AICc criterion of Hurvich and Tsai ( 1989 ) because it is more conservative 



as the degrees of freedom become large compared to the sample size. 

Next we illustrate some special cases of the degrees of freedom formula in 
Theorem [1] First, suppose that A = 0, so that there is no penalization on 7. 
Then df (0) = tr(/) = d as is appropriate for regression on sample S only. 

We can easily see that the degrees of freedom are monotone decreasing in A. 
As A — )■ 00 the degrees of freedom drop to df(oo) = J2j=i ^j/(l + '^j)- This can 
be much smaller than d. For instance in the proportional design case, Vs = nil 
and Vb = NY; for a matrix E. Then all i/^ = n/N and so df(oo) = d/{l + N/n), 
which is quite small when n <^ N. 

For the cell partition model, d becomes C, S5 — diag(nc) and S^ = 
diag(Afc)- In this case df(oo) = J2c=i ^c/inc + Nc) which will usually be much 
smaller than df (0) — C . 

Monotonicity of the degrees of freedom makes it easy to search for the value 
A which delivers a desired degrees of freedom. We have found it useful to inves- 
tigate A over a numerical grid corresponding to degrees of freedom decreasing 
from d by an amount A (such as 0.25) to the smallest such value above df(oo). 
It is easy to adjoin A = 00 (sample pooling) to this list as well. 

2.4 Predictive mean square errors 

Here we develop an oracle's choice for A and a corresponding plug-in estimate. 
We work in the case where Vs = Vt and we assume that Vs has full rank. Given 
A, the predictive mean square error is E(||X5(/3 — /3)||^). 



1/2 

We will use a symmetric square root Vg of Vs as well as the matrix M = 
Vg Vg^Vg with eigendecomposition M = UDU~^ where the j'th column of 
U is Uj and D = diag(^'j). 

Theorem 2. The predictive mean square error of the data enrichment estimator 
is 

J— 1 J— 1 



where k'J = ujVg BVg Mj /or 6 = 77^ + cr^V; 



-1 



Proof. Please see Section 8.2 D 



The first term in ([8| is a variance term. It equals da^ when A = but for 
A > it is reduced due to the use of the big sample. The second term represents 
the error, both bias squared and variance, introduced by the big sample. 

2.5 A plug-in method 

A natural choice of A is to minimize the predictive mean square error, which 
must be estimated. We propose a plug-in method that replaces the unknown 
parameters (t| and Kj from Theorem by sample estimates. For estimates (t| 
and k3 we choose 






a|(l + Ai.,)2 + A2k2 



A = argmm> — — — -. (9) 

From the sample data we take (75 = WYs—Xs/isW^/in—d). A straightforward 
plug-in estimate of 9 is 

6piug =lj'^ + ^%Vb^, 

where ^ — Pb ~ Ps- Now we take k| = uJVj' QVg' u.j recalling that Uj and 
Vj derive from the eigendecomposition oi M = Vg Vg Vg . The resulting 
optimization yields an estimate Apiug. 

The estimate Opiug is biased upwards because £(77^) = 77^ -I- cr^Vg^ + 
o'gVg^. We have used a bias-adjusted plug-in estimate 

Obapi = alVs' + (77^ - <7lVs' - ^lVg')+ (10) 

where the positive part operation on a symmetric matrix preserves its eigenvec- 
tors but replaces any negative eigenvalues by 0. Similar results can be obtained 
with Gbapi = (77^ — '^s^s )j-- This latter estimator is somewhat simpler but 
the former has the advantage of being at least as large as o'gVg^ while the 
latter can degenerate to 0. 



3 The location model 

The simplest instance of our problem is the location model where Xs is a column 
of n ones and Xb is a column of N ones. Then the vector j3 is simply a scalar 
intercept that we call /i and the vector 7 is a scalar mean difference that we call 
5. The response values in the small data set are y^ = /_* + e^ while those in the 
big data set are Yi = {fi + 6) + Si. Every quadratic penalty defines the same 
family of estimators as we get using penalty \5^ . 

The quadratic criterion is X]j:es(^« ~ A*)^ + X]i(EB(^« ^ A* ^ <^)^ + ^^^- Taking 
Vs — n,VB = N and Vr = 1 in Lemma [l] yields 

, , - , nN + n\ 1 + A/TV 

jji — ioYs + [1 — io)YB with u) — — 



nN + nX + NX l + X/N + X/n' 
Choosing a value for uj corresponds to choosing 

nN{l-uj) 



A = 



Nio-n{l-Lu)' 



The degrees of freedom in this case reduce to df(A) = cu, which ranges from 
df(0) = 1 down to df (00) = n/{n + N). 

3.1 Oracle estimator of to 

The mean square error of fl{io) is 

MSE(^) = 0,2^ + (1 - uf (^ + 5^) . 

The mean square optimal value of uj (available to an oracle) is 

6^ + a%/N 
'^""''^ S'^ + al/N + al/n 

Pooling the data corresponds to Wpooi — n/(N+n) and makes ft equal the pooled 
mean Yp = (nYg + NYB)/{n + N). Ignoring the large data set corresponds to 
ws = 1. Here ojpooi ^ Worci ^ ^s- The oracle's choice of uj can be used to infer 
the oracle's choice of A. It is 



nN{l - Word) Na^ 



s 



(11) 



^l^orcl - n(l - Word) N S'^ + f^B ^ '^S 

The mean squared error reduction for the oracle is 

MSE(Word) _ ,^„. 

MSE(ws) " ''°'"^' ^^^> 

after some algebra. If (5 7^ 0, then as min(n, N) — > 00 we find Word ~> 1 and the 
optimal A corresponds to simply using the small sample and ignoring the large 



one. If we suppose that 6^0 and N —^ oo then the effective sample size for 



data enrichment may be defined using ( f 2 ) as 



^^5^ + al/N + al/n 



's 



c^orci S^ + al/N <52 



(13) 



The mean squared error from data enrichment with n observations in the smah 
sample, using the oracle's choice of A, matches that of n IID observations from 
the small sample. We effectively gain up to <7g/S'^ observations worth of infor- 
mation. This is an upper bound on the gain because we will have to estimate A. 



Equation ( 13 ) shows that the benefit from data enrichment is a small sample 
phenomenon. The effect is additive not multiplicative on the small sample size 
n. As a result, more valuable gains are expected in small samples. In some 
of the motivating examples we have found the most meaningful improvements 
from data enrichment on disaggregated data sets, such as specific groups of 
consumers. Some large data sets resemble the union of a great many small 
ones. 

3.2 Plug-in and other estimators of u 

A natural approach to choosing tu is to plug in sample estimates 

So^Yb-Ys, ^l^lT.^Y.-Ysf, and a| = ^^ ^(F, - ¥5)2. 

We then use Wpiug = {Sq + a'^/N)/{5Q + o^jN + cr'g/n) or alternatively Apiug = 
i7|/((T^ + ^'^o)- Oii^ bias-adjusted plug- in method reduces to 

fbapi , n ^B , fi2 ^S ^B 

Wbapi = -~ ^— ' where 6'bapi = ^ + [^ 

fc'bapi + cr^/n 



N K" n N 



The simpler alternative Wbapi = ii% ~ ^s / ''^) / %) + g^'Ve virtually identical values 
in our numerical results reported below. 

If we bootstrap the S and B samples independently M times and choose w 
to minimize 

-j2{Ys-ujYs"^*~{i~u)Yry, 

m—l 

then the minimizing value tends to Wpiug as M ^ 00. Thus bootstrap methods 
give an approach analogous to plug-in methods, when no simple plug-in formula 
exists. This is perhaps not surprising since the bootstrap is often described as 
an example of a plug-in principle. 

We can also determine the effects of cross-validation in the location setting, 
and arrive at an estimate of w that we can use without actually cross- validating. 
Consider splitting the small sample into K parts that are held out one by one 
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in turn. The K — 1 retained parts are used to estimate /i and then the squared 
error is judged on the held-out part. That is 



K 

2 



ujr^ = argmm jr^iYs^k - wis _fc - (1 - uj)Yi 



K 

fc=i 



where Ys^k is the average of F, over the fc'th part of S and Ys.-k is the average 
of Yi over a\\ K — \ parts excluding the fc'th. We suppose for simphcity that 
n = rK for an integer r. In that case Ys.-k = {nYs — rYs^k)/{n — r). Now 



^ Y.k{Ys.-k-YB){Ys^k-YB) 
T.k{Ys.-k-YB? 

After some algebra, the numerator of ( [T4| is 

K 

K{Ys - YbY - -^ Y^kYs.k - Ys? 
and the denominator is 

K{Ys - YbT + ( -^— ) y^{Ys,k " Ysf 



(14) 



n — r 

k=l 



2 K 

] 

n — r , 



k=l 

Letting 5o=Yb ~ % and o-| ^^ == (V^) Ef=i(^s,fc - Ysf, we have 

The only quantity in Wcv which depends on the specific K-w&y partition 
used is o^g X- ^^ ^^'^ groupings are chosen by sampling without replacement, 
then under this sampling, 

E(4^) = E((Y5a - Ys?) = ^(1 - l/K) 

using the finite population correction for simple random sampling, where s^ = 
a'gn/(n — 1). This simplifies to 

n 1 i^ - 1 ,oK-l 



ns^lK) = ^l—^-^^ ^ ^l- 



'n-lr K ■= n- 1 

Thus X-fold cross-validation chooses a weighting centered around 

.„.., ^g-0|/C-l) . (15, 

■ 4g + *!/[(>. -l)(7f- 1)1 

Cross-validation has the strange property that w < is possible. This can arise 
when the bias is small and then sampling alone makes the held-out part of the 
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small sample appear negatively correlated with the held-in part. The effect can 
appear with any K. We replace any uJqv.k < n/{n + N) by n/{n + N). 
Leave-one-out cross-validation has K = n (and r = 1) so that 

So + 4/"' 
Smaller K, such as choosing K = 10 versus n, tend to make uJcv,k smaller 
resulting in less weight on Yg. In the extreme with Sq ^ we find ll!cv,k ~ 
— {K — 1) so 10 fold CV is then very different from leave-one-out CV. 

Remark 1. The cross-validation estimates do not make use of o-^ because 
the large sample is held fixed. They are in this sense conditional on the large 
sample. Our oracle takes account of the randomness in set S, so it is not 
conditional. One can define a conditional oracle without difficulty, but we omit 
the details. Neither the bootstrap nor the plug-in methods are conditional, as 
they approximate our oracle. Comparing cross-validation to the oracle we expect 
this to be reasonable if (J%/N <C min((5^,cr^/n). Taking ujhapi as a representor 
of unconditional methods and ojcv.n as a representor of conditional ones, we 
see that the latter has a larger denominator while they both have the same 
numerator, at least when 6q > o'g/n. This suggests that conditional methods 
are more aggressive and we will see this in the simulation results. 

3.3 Li penalty 

For the location model, it is convenient to write the Li penalized criterion as 

J2iy^-^^r + J2(y^-^^-sr + 2X\s\. (16) 

The minimizers /t and 6 satisfy 

„ nYs + N{Yb - S) 

M= ^^ 1 and ,,„, 

' n + N (17) 

5^eiYB~fi;\/N) 

for the well-known soft thresholding operator 0(z;t) = sign(z)(|z| — t)^. 

The estimate /t ranges from Yg at A = to the pooled mean Yp at A = cxd. 
In fact jj, reaches Yp at a finite value A = A* = nN\YB — Y5|/(A^ + n) and both 
/t and S are linear in A on the interval [0, A.^]: 



Theorem 3. //O ^ A ^ nNlYp -~Ys\/(n + N) then the minimizers of (16) are 



fl^Ys + -aign{YB -Ys), and 

(18) 

- - N + n - - 

S = Yb-Ys- X—— sign{YB - Ys). 
i\n 

If \> nN\YB - Ys\/in + N) then they are S ^ and fi = Yp. 
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Proof. If A > nN\YB — Ys\/{n + N) then we may find directly that with any 
value of (5 > and corresponding /i given by (17), the derivative of (16) with 



respect to S is positive. Therefore S ^ and a similar argument gives S ^ 0, so 
that ^ = and then /x = {nYs + NYB)/{n + N). 

Now suppose that A ^ A*. We verify that the quantities in (18) jointly 



satisfy equations (17). Substituting 5 from (18) into the first line of (17) yields 

N{Ys + \{N + n)r^l{Nn)) 



nY^ 



N 



Y. 



A 



s + - sign (Yb 
n 



Y^ 



matching the value in ( 18 ). Conversely, substituting /i from ( 18 ) into the second 



line of (17) yields 



A;^)=e(F, 



% - ^sign(YB -Ys)\\ 
n N 



(19) 



Because of the upper bound on A, the result is Yb — I5-- A(l/n + l/iV)sign(YB - 



Ys) which matches the value in (18) 



D 



With an Li penalty on S we find from Theorem [3] that 

/i = Ys + min(A, A*)sign(YB - Ys)/n. 

That is, the estimator moves Ys towards Yb by an amount X/n except that 
it will not move past the pooled average Yp. The optimal choice of A is not 
available in closed form. 

3.4 An Li oracle 

Under a Gaussian data assumption, it is possible to derive a formula for the 
mean squared error of the Li penalized data enrichment estimator at any value 
of A. While it is unwieldy, the Li mean square error formula is computable and 
we can optimize it numerically to compute an oracle formula. As with the L2 
setting we must plug in estimates of some unknowns first before optimizing. This 
allows us to compare Li to L2 penalization in the location setting simulations 
of Section ID 

To obtain a solution we make a few changes of notation just for this subsec- 
tion. We replace \/n by A and define a = N/{N + n) and use Sq = Yb — Yg . 
Then 

A(A) - (is + A • sign(,5o))/(|<5o|a ^ A) + (aYs + (1 - a)Ys)/(|<5o|a < A) 

= (oYb + (1 - a)Ys) - (ado - A • sign(,5o))/(|'5o|a ^ A). (20) 

Without loss of generality we may center and scale the Gaussian distributions 
so that Ys ~ A/'(0, 1) and Yb ^ J\f{S,a^). The next Theorem defines the 
distributions of Y^ for i G S and i G B to obtain that scaling. We also introduce 
constants b = a^/{l+a'^), S = S/^/l + cr^, x — (A/a)/-\/l -I- cr^, and the function 
g{x) = ^{x)~x(p{x) where (p and <& are the Af{0, 1) probability density function 
and cumulative distribution function, respectively. 
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Theorem 4. Suppose that Y^ ^ A/'(0, n) for i £ S independently of Yi ~ 
J\f{S,a'^N) for i G B. Let ft be the Li estimate from (20), using parameter 



A ^ 0. Then the predictive mean squared error is 

E(A(A)2) ^ a^6^ + {a + b- 1)2(1 + a"^) + b 

- a{a + 26 - 2)(1 + a^)[l ~ g{i ~ 6) + g{-S: ~ 6)] 

- [2aA + 2{a + b- l){a5 - X)]y/l+~^fix - S) (21) 

- [2aA -2{a + b- l){a5 + A)] VTT^(p(-x - S) 

- {a6 - X){ad + A)[l - $(£ - S) + $(-x - S)]. 

Proof. Please see Section [8?3| in the Appendix. D 

3.5 Cell means 

The ceU mean setting is simply C copies of the location problem. One could 
estimate separate values of A in each of them. Here we remark briefly on the 
consequences of using a common A or oj over all cells. 

We do not simulate the various choices. We look instead at what assumptions 
would make them match the oracle formula. In applications we can choose the 
method whose matching assumptions are more plausible. 

In the -L2 setting, one could choose a common A using either the penalty 
A ^c=i "C^c or A J2c=i ^c • Call these cases ^2,™ and ^2,1 respectively. Dropping 
the subscript c we find 

1 + Xn/N , 1 + A/TV 

^L2 „ = 1—. — ,,, , , , and UJL2 1 = 



1 + Xn/N + y "^"'^ 1 + A/A^ + A/n 

compared to Word — (nS'^ + (Tgn/N)/{nS^ + a^n/N + cr|). 

We can find conditions under which a single value of A recovers the oracle's 
weighting. For UJL2.1 these are ag^~ (j'g_^ in all cells as well as A = cr'^s c/^l 
constant in c. For Wia n these are cr^ ^ = cr|^ and A = o'|^/(nc<5^) constant 
in c. The £2,1 criterion looks more reasonable here because we have no reason 
to expect the relative bias 5c/<Js,c to be inversely proportional to ^Jric. 

For a common lo to match the oracle, we need cr^ cl-^c = f 5 cl''^c to hold in 
all cells as well as a ct| c/("-c^c) to be constant in c. The first clause seems quite 
unreasonable and so we prefer common-A approaches to common weights. 

For a common Li penalty, we cannot get good expressions for the weight 
variable w. But we can see how the Li approach shifts the mean. An Li^i 
approach moves /tc from Ys^c towards Yb,c by the amount X/uc in cell c, but 
not going past the pooled mean Yp^ = {nYs^c + NYb,c)/{N + n) for that cell. 
The other approaches use different shifts. An Li^ approach moves fi^ from Ygc 
towards Yb.c by the amount A in cell c (but not past Yp^c)- It does not seem 
reasonable to move /ic by the same distance in all cells, or to move them by an 
amount proportional to l/ric and stopping at Yp^ doesn't fix this. We could 
use a common moving distance proportional to 1/ ^/ric (which is the order of 



statistical uncertainty in Ys,c) by using the penalty X]c=i v^l^cl- 
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4 Numerical examples 

We have simulated some special cases of the data enrichment problem. First 
we simulate the pure location problem which has d — I. Then we consider the 
regression problem with varying d. 

4.1 Location 

We simulated Gaussian data for the location problem. The large sample had 
N = 1000 observations and the small sample had n = 100 observations: Xi ~ 
Af{fi, (t|) for i e S* and Xi ~ A/'(/i + S, ag) for i £ B. Our data had fi — and 
ag — ag — I. We define the relative bias as 



We investigated a range of relative bias values. It is only a small simplification 
to take o'g = a^. Doubling a^ has a very similar effect to halving N. Equal 
variances might have given a slight relative advantage to the hypothesis testing 
method as described below. 

The accuracy of our estimates is judged by the relative mean squared error 
E((/i — /i)^)/(cr|/ri). Simply taking fi = Ys attains a relative mean squared 
error of 1. 

Figure [T] plots relative mean squared error versus relative bias for a collection 
of estimators, with the results averaged over 10,000 simulated data sets. We used 
the small sample only method as a control variate. 

The solid curve in Figure [T] shows the oracle's value. It lies strictly below 
the horizontal S'-only line. None of the competing curves lie strictly below that 



line. None can because Yg is an admissible estimator for d = 1 (Stein 1956). 
The second lowest curve in Figure [T] is for the oracle using the Li version of 
the penalty. The Li penalized oracle is not as effective as the L2 oracle and 
it is also more difficult to approximate. The highest observed predictive MSEs 
come from a method of simply pooling the two samples. That method is very 
successful when the relative bias is near zero but has an MSE that becomes 
unbounded as the relative bias increases. 

Now we discuss methods that use the data to decide whether to use the 
small sample only, pool the samples or choose an amount of shrinkage. We may 
list them in order of their worst case performance. From top (worst) to bottom 
(best) in Figure fl] they are: hypothesis testing, 5-fold cross-validation, 10-fold 
cross-validation, AICc, leave-one-out cross-validation, and then the simple plug- 
in method which is minimax among this set of choices. AICc and leave-one-out 
are very close. Our cross-validation estimators used uj ~ max(wcv,i<', n/{'n-'r N)) 



where Wcv.k is given by ( 15 1 



The hypothesis testing method is based on a two-sample i-test of whether 
5 = Q. If the test is rejected at a = 0.05, then only the small sample data is 
used. If the test is not rejected, then the two samples are pooled. That test was 
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Relative bias 

Figure 1: Numerical results for the location problem. The horizontal line at 1 
represents using the small sample only and ignoring the large one. The lowest 
line shown is for an oracle choosing A in the L2 penalization. The green curve 
shows an oracle using the Li penalization. The other curves are as described in 
the text. 



based on a^ — dg which may give hypothesis testing a slight advantage in this 
setting (but it still performed poorly). 

The AICc method performs virtually identically to leave-one-out cross-validation 
over the whole range of relative biases. 

None of these methods makes any other one inadmissible: each pair of curves 
crosses. The methods that do best at large relative biases tend to do worst 
at relative bias near and vice versa. The exception is hypothesis testing. 
Compared to the others it does not benefit fully from low relative bias but it 
recovers the quickest as the bias increases. Of these methods hypothesis testing 
is best at the highest relative bias, K-io\d cross-validation with small K is best 
at the lowest relative bias, and the plug-in method is best in between. 

Aggressive methods will do better at low bias but worse at high bias. What 
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we see in this simulation is that i^-fold cross-vahdation is the most aggressive 
followed by leave-one-out and AICc and that plug-in is least aggressive. These 
findings confirm what we saw in the formulas from Section [3] Hypothesis testing 
does not quite fit into this spectrum: its worst case performance is much worse 
than the most aggressive methods yet it fails to fully benefit from pooling when 
the bias is smallest. Unlike aggressive methods it does very well at high bias. 

4.2 Regression 

We simulated our data enrichment method for the following scenario. The small 
sample had n — 1000 observations and the large sample had N — 10,000. The 
true /3 was taken to be 0. This has no loss of generality because we are not 
shrinking /3 towards 0. The value of 7 was taken uniformly on the unit sphere 
in d dimensions and then multiplied by a scale factor that we varied. 

We considered d = 2,4,5 and 10. All of our examples included an intercept 
column of Is in both Xs and Xb- The other d—1 predictors were sampled from a 
Gaussian distribution with covariance Cs or Cb , respectively. In one simulation 
we took Cs and Cb to be independent Wishart(/, d— 1, d— 1) random matrices. 
In the other they were sampled as Cs = Id-i + puvJ and Cb = Id-i + pvv^ 
where u and v are independently and uniformly sampled from the unit sphere 
in W^^^ and p ^ is a parameter that measures the lack of proportionality 
between covariances. We chose p — d so that the sample specific portion of the 
variance has comparable magnitude to the common part. 

We scaled the results so that regression using sample S only yields a mean 
squared error of 1 at all values of the relative bias. We computed the risk of an 
L2 oracle, as well as sampling errors when A is estimated by the plug-in formula, 
by our bias-adjusted plug-in formula and via AICc. In addition we considered 
the simple weighted combination u^s + (1 — a;)/3s with uj chosen by the plug-in 
formula. 

Figure [2] shows the results. For d — 2 and also d = 4 none of our methods 
universally outperforms simply using the S sample. For d — 5 and d — 10, all 
of our estimators have lower mean squared error than using the S sample alone, 
though the difference becomes small at large relative bias. 

We find in this setting that our bias-adjusted plug-in estimator closely matches 
the AICc estimate. The relative performance of the other methods varies with 
the problem. Plain plug-in always seemed worse than AICc and adjusted plug- 
in at low relative bias and better than these at high biases. Plug-in's gains 
at high biases appear to be less substantial than its losses at low biases. Of 
the other methods, simple scalar weighting is worst for the high dimensional 
Wishart case without being better in the other cases. The best overall choices 
are bias-adjusted plug-in and AICc. 
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Figure 2: This figure shows relative predicted MSE versus relative bias for two 
simulated regression problems described in the text. 



5 Proportional design and inadmissibility 

The proportional design case has Vb oc Vs and Vr oc ^5. Suppose that Vg = 
NY,, Vs — nT, and Vr — "^ for a positive definite matrix S. Our data enrichment 
estimator simplifies greatly in this case. The weighting matrix W\ in Lemma [T] 
simplifies to W\ — ujI where to = {N + nX)/{N + nX + NX). As a result 
/3 — uj(3s + (1 ~ ^)I3b and we can find and estimate an oracle's value for uj. If 
different constants of proportionality, say M and m are used, then the effect is 
largely to reparameterize A giving the same family of estimates under different 
labels. There is one difference though. The interval of possible values for a; is 
[n/N, 1] in our case versus [m/M, 1] for the different constants. To attain the 
same sets of uj values could require use of negative A. 

The resulting estimator of /3 with estimated w dominates Ps (making it 
inadmissible) under mild conditions. These conditions given below even allow 
violations of the proportionality condition Vb oc Vs but they still require Vt oc 
Vs- Among these conditions we will need the model degrees of freedom to be 
at least 5, and it will suffice to have the error degrees of freedom in the small 
sample regression be at least 10. The result also requires a Gaussian assumption 
in order to use a lemma of Stein's. 

We write Ys = XsjS + es and Yb = Xb(/3 + 7) + eg for es '^ Af{0, ct|) 

and £_B ~ A/'(0, cr^). The data enrichment estimators are /3(A) and 7(A). The 
parameter of most interest is /3. If we were to use only the small sample we 
would get /3s = iXjXs)~'XjYs = /3(0). 



18 



In the proportional design setting, the mean squared prediction error is 

/H = e(||Xt(/3H-/3)|P) 

= tr((c.24l]^i + (1 - c.)2(77T + a%E^'))E). 
This error is minimized by the oracle's parameter value 

"^' tr((77T + a|E^i)E) + a|tr(E^iE)- 

With Ss = nS and Sb = iVE, we find 

_ -f'^Y.-f + dal/N 

''°''' ~ ^-^i:^ + dal/N + dal/n 

The plug-in estimator is 

'^piug - ^T^^ + da|/iV + dal/n ^ ^' 

where d| = \\Ys - Xs^s\\V{n - d) and a| = \\Yb - XEPsWyiN - d). We 
will have reason to generalize this plug- in estimator. Let h(a^) be any nonneg- 
ative measurable function of a^ with E{h(a%)) < 00. The generalized plug-in 
estimator is 

_ ^E^ + hjal) 

^Pius,h - ^T^^ + h{al) + dal/n ' ^ ' 

Here are the conditions under which (3s is made inadmissible by the data 
enrichment estimator. 

Theorem 5. Let Xs G M"'''' and Xb eR^'''' be fixed matrices with XjXs = 
nE and X^Xg =^ -/VE^ where E and E^ both have rank d. Let Yg ~ Af{Xgf3, dgln) 
independently of Yb ^ N{Xb{I3 + 7),o'^/Ar). If d ^ 5 and m = n — d ^ 10, 
then 

E(||Xt/3(^) - Xt(3\\^) < E(\\Xt^s - ^t/3|P) (24) 

holds for any nonrandom matrix Xt with X^Xt = E and any Cj = (Dpiug.h given 



by (23) 



Proof. Please see Section 8.5 in the Appendix. D 



The condition on m can be relaxed at the expense of a more complicated 
statement. From the details in the proof, it suffices to have d ^ 5 and m(l — 
4/d) ^ 2. 

The result in Theorem [5] is similar to the Stein estimator result. There, the 
sample mean of a Gaussian population is an inadmissible estimator in d = 3 
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dimensions or higher but is admissible in 1 or 2 dimensions. Here there are two 
samples to pool and the change takes place a,t d — 5. 

Because E(7^S7) = 7^S7 + dog/n + da^/N it is biased high and so there- 
fore is Wpiug, making it a little conservative. We can make a bias adjustment, 



replacing 7^X7 by 7^27 — da'g/n — d&g/N. The result is 



Wbapi 



J^'S'^ — d&g/n 



N' 



(25) 



where values below n/(n + N) get rounded up. This bias-adjusted estimate of w 
is not covered by Theorem^ Subtracting only aj^/N instead of ct'b/N -\- cr'g/n 



is covered, yielding 



l^^l 



bapi -5,TS^ + rf^2/„' 



(26) 



which corresponds to taking h{ag) = in equation (23). 



6 Related literatures 

There are many disjoint literatures that study problems like the one we have 
presented. They do not seem to have been compared before and the literatures 
seem to be mostly unaware of each other. We give a summary of them here, 
kept brief because of space limitations. 

The key ingredient in this problem is that we care more about the small 
sample than the large one. Were that not the case, we could simply pool all the 
data and fit a model with indicator variables picking out one or indeed many 
different small areas. Without some kind of regularization, that approach ends 
up being similar to taking A = and hence does not borrow strength. 

The closest match to our problem setting comes from small area estimation in 



survey sampling. The monograph by Rao ( 2003 1 is a comprehensive treatment 



of that work and Ghosh and Rao ( 1994 ) provide a compact summary. In that 



context the large sample may be census data from the entire country and the 
small sample (called the small area) may be a single county or a demographically 
defined subset. Every county or demographic group may be taken to be the 



small sample in its turn. The composite estimator (Rao 2003 Chapter 4.3) is a 



weighted sum of estimators from small and large samples. The estimates being 
combined may be more complicated than regressions, involving for example 
ratio estimates. The emphasis is usually on scalar quantities such as small 
area means or totals, instead of the regression coefficients we consider. One 



particularly useful model (Ghosh and Rao 1994 Equation (4.2)) allows the 
small areas to share regression coefficients apart from an area specific intercept. 
Then BLUP estimation methods lead to shrinkage estimators similar to ours. 



The methods of Copas ( 1983 ) can be applied to our problem and will result 



in another combination that makes /?§ inadmissible. That combination requires 
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only four dimensional regressions instead of the five used in Theorem [5] for 
pooling weights. That combination yields less aggressive predictions. 

In chemometrics a calibration transfer problem ( Feudale et al. 2002[ ) comes 
up when one wants to adjust a model to new spectral hardware. There may be a 
regression model linking near-infrared spectroscopy data to a property of some 
sample material. The transfer problem comes up for data from a new machine. 
Sometimes one can simply run a selection of samples through both machines 
but in other cases that is not possible, perhaps because one machine is remote 
(Woody et al. 2004). Their primary and secondary instruments correspond to 



our small and big samples respectively. Their emphasis is on transfering either 
principal components regression or partial least squares models, not the plain 
regressions we consider here. 

A common problem in marketing is data fusion, also known as statistical 
matching. Variables (X, Y) are measured in one sample while variables {X, Z) 
are measured in another. There may or may not be a third sample with some 
measured triples {X,Y, Z). The goal in data fusion is to use all of the data to 
form a large synthetic data set of {X, Y, Z) values, perhaps by imputing missing 
Z for the {X, Y) sample and/or missing Y for the {X, Z) sample. When there is 
no {X, Y, Z) sample some untestable assumptions must be made about the joint 
distribution, because it cannot be recovered from its bivariate margins. The 



text by D'Orazio et al. (2006) gives a comprehensive summary of what can and 



cannot be done. Many of the approaches are based on methods for handling 



missing data (Little and Rubin 2009). 



Our problem is an instance of what machine learning researchers call do- 
main adaptation. They may have fit a model to a large data set (the 'source') 
and then wish to adapt that model to a smaller specialized data set (the 'tar- 
get'). This is especially common in natural language processing. NIPS 2011 
included a special session on domain adaptation. In their motivating problems 
there are typically a very large number of features (e.g., one per unique word 
appearing in a set of documents) . They also pay special attention to problems 
where many of the data points do not have a measured response. Quite often 
a computer can gather high dimensional X while a human rater is necessary 



to produce Y. Daume (2009) surveys various wrapper strategies, such as fit- 



ting a model to weighted combinations of the data sets, deriving features from 
the reference data set to use in the target one and so on. [Cortes and MohrT| 
(2011) consider domain adaptation for kernel-based regularization algorithms, 
including kernel ridge regression, support vector machines (SVMs), or support 
vector regression (SVR). They prove pointwise loss guarantees depending on 
the discrepancy distance between the empirical source and target distributions, 
and demonstrate the power of the approach on a number of experiments using 
kernel ridge regression. 



A related term in machine learning is concept drift (Widmer and Kubat 



19961. There a prediction method may become out of date as time goes on. 



The term drift suggests that slow continual changes are anticipated, but they 
also consider that there may be hidden contexts (latent variables in statistical 
teminology) affecting some of the data. 
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7 Conclusions 

We have studied a middle ground between pooling a large data set into a smaller 
target one and ignoring it completely. In dimension d ^ 5 only a small number of 
error degrees of freedom suffice to make ignoring the large data set inadmissible. 
When there is no bias, pooling the data sets may be optimal. Theorem [5] does 
not say that pooling is inadmissible. When there is no bias, pooling the data 
sets may be optimal. We prefer our hybrid because the risk from pooling grows 
without bound as the bias increases. 



Acknowledgments 

We thank the following people for helpful discussions: Penny Chu, Corinna 
Cortes, Tony Fagan, Yijia Feng, Jerome Friedman, Jim Koehler, Diane Lambert, 
Elissa Lee and Nicolas Remy. 



References 

Borenstein, M., Hedges, L. V., Higgins, J. P. T., and Rothstein, H. R. (2009). 
Introduction to M eta- Analysis. Wiley, Chichester, UK. 

Copas, J. B. (1983). Regression, prediction and shrinkage. Journal of the Royal 
Statistical Society, Series B, 45(3):311-354. 

Cortes, C. and Mohri, M. (2011). Domain adaptation in regression. In Proceed- 
ings of The 22nd International Conference on Algorithmic Learning Theory 
(ALT 2011), pages 308-323, Heidelberg, Germany. Springer. 

Daume, H. (2009). Frustratingly easy domain adaptation. (arXiv:0907.1815). 

D'Orazio, M., Di Zio, M., and Scanu, M. (2006). Statistical Matching: Theory 
and Practice. Wiley, Chichester, UK. 

Efron, B. (2004). The estimation of prediction error. Journal of the American 
Statistical Association, 99(467) :619-632. 

Feudale, R. N., Woody, N. A., Tan, H., Myles, A. J., Brown, S. D., and Ferre, J. 
(2002) . Transfer of multivariate calibration models: a review. Chemometrics 
and Intelligent Laboratory Systems, 64:181-192. 

Ghosh, M. and Rao, J. N. K. (1994). Small area estimation: an appraisal. 
Statistical Science, 9(l):55-76. 

Hurvich, C. and Tsai, C. (1989). Regression and time series model selection in 
small samples. Biometrika, 76(2):297-307. 

Little, R. J. A. and Rubin, D. B. (2009). Statistical Analysis with Missing Data. 
John Wiley & Sons Inc., Hoboken, NJ, 2nd edition. 



22 



Rao, J. N. K. (2003). Small Area Estimation. Wiley, Hoboken, NJ. 

Stein, C. M. (1956). Inadmissibility of the usual estimator for the mean of a mul- 
tivariate normal distribution. In Proceedings of the Third Berkeley symposium 
on mathematical statistics and probability, volume 1, pages 197-206. 

Stein, C. M. (1960). Multiple regression. In Olkin, I., Ghurye, S. G., Hoeffding, 
W., Madow, W. G., and Mann, H. B., editors, Contributions to probability 
and statistics: essays in honor of Harald Hotelling. Stanford University Press, 
Stanford, CA. 

Stein, C. M. (1981). Estimation of the mean of a multivariate normal distribu- 
tion. The Annals of Statistics, 9(6):1135-1151. 

Widmer, G. and Kubat, M. (1996). Learning in the presence of concept drift 
and hidden contexts. Machine Learning, 23:69-101. 

Woody, N. A., Feudale, R. N., Myles, A. J., and Brown, S. D. (2004). Transfer 
of multivariate calibrations between four near-infrared spectrometers using 
orthogonal signal correction. Analytical Chemistry, 76(9):2596-2600. 

Ye, J. (1998). On measuring and correcting the effects of data mining and model 
selection. Journal of the American Statistical Association, 93:120-131. 



8 Appendix: proofs 

This appendix presents proofs of the results in this article. They are grouped 
into sections by topic, with some technical supporting lemmas separated into 
their own sections. 

8.1 Proof of Theorem [l] 

First df (A) = cr-2tr(cov(A:s/3,ys)) = aghT:{XsWx{XlXs)-^Xlal) = tT{Wx). 
Next with Xt = Xs, and M = Vy^V^^Vy^ , 

tr(iyA) = tr(ys + WsVb^Vs + >ysY\ys + AV^sV^s '"^s)- 



1 j'2 1/2 

We place Vg Vg between these factors and absorb them left and right. Then 
we reverse the order of the factors and repeat the process, yielding 

tr(M^A) = tr(/ + \M + Xiy^{I + \M). 

Writing M = C/diag(^i, . . . , Vd)U~^ for an orthogonal matrix U and simplifying 
yields the result. D 
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8.2 Proof of Theorem [2] 

Proof. First E{\\Xt^ - Xt(3\\^) = tr(VsE((/3 - /3)(/3 - /3)T)). Next using W = 
W\, we make a bias- variance decomposition, 

E((/3 - ;9)(/3 - /3)T) = (/ _ W^)77T(j _ yy^x ^ cov(W^/3s) + cov((/ - W)Pb) 

^ (jIwVs'^w^ + {i - w)e{i - wY, 

for 8 = 77^ + ffly^ 1. Therefore E(||Xs(/3 - /3)||2) = alir{VsWVs^W^) + 
tr(e(/-l^)"ri/s(/- VF)). 

— ~^ 1/2 1/2 

Now we introduce W = Vg WVg finding 

W = Vg^\VB + \Vs + XVBy\VB + XVs)Vs^^^ 
= {I + \M + XI)-\I + AM) 
= UDU~^, 

where D = diag((l + Xi'j)/{1 + A + Xi^j))- This aUows us to write the first term 
of the mean squared error as 

' (1 + Az.,)2 



altiiVsWVs^W~^) = crltTiWW'^) = cr|Xl TT 



^(1 + A + A«,)2 



For the second term, let 6 = Vg^^QVg^'^. Then 



tr(e(/ - WfVs{I - W)) = tr(e(/ - W)'^{I - W)) 

= ix{QU{I-DfU^) 

d „Tt/1/2c>t^1/2 



D 



8.3 Proof of Theorem H 

We will use this small lemma. 

Lemma 2. //X ~ 7V(0, 1), then ¥.{XI{X < x)) = -(^(x), E(X2/(X < x)) = 
g{x) and 

E{X^I{\X + b\^x))^l- g{x -b)+ g{~x - b) 

where g{x) = $(a;) — xip{x). 

Proof. First E{XI{X s^ x)) = /^^ z(p{z) dz = - /^_^ (^'(z) dz = -~f{x). Next, 

z <f{z) dz = — zip' (z) dz = — / 'y5(z) dz — z(/3(z) = 9{x). 
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Then 

E{X^I{\X + b\^ x)) = E{X^I{X + b^x))+ E{X^I{X + fe sC -x)) 
= E(X2(1 - /(X + 6 s^ a;)) + g{~x - b)) 
= E{X^) - E{X^I{X + 6 s^ x)) + g{-x - b) 
= l~g{x~b)+g{-x-h). U 

Now we prove Theorem |4| We let t — 6q — 6 and j^ — Yb + cy^Ys — 5 ■ Then 

cov(e,r?) = 0, e ~ 7V(0, 1 + cr^), ?/ - 7V(0, cr^ + cr''), and Yg - ^ ~ ^ 



1 + a^ 



Recall that we defined b = a / {I + a ) , and so 

YB=S + r]- <T^^^^^ =6 + be+il- b)r]. 
1 + cr"^ 

Also with a = N/{N + n), 

7?-e 



qYb + (1 - a)y5 = a(5 + a(6e + (1 - 6)77) + (1 - a)- „ 

1 + (T^ 

= a(5 + (a6 - (1 - a)(l - 6))e + (a(l - 5) + (1 - a)(l - 5))?? 
= aS + (a + b - l)e + {1 - b)f]. 

Letting S = e + 6, we have 

fi^a5+{a + b-l)e+{l-b)Tj-{aS-X- sign(S'))/(|S'| ^ a^^A) 

from which the MSE can be calculated: 

E(/i2(A)) = E{{a6 + (a + 6 - l)e + (1 - b)rjf) 

- 2E{{aS + (a + 6 - l)e + (1 - b)ri){aS - A • sign(S'))/(|S'| ^ a-^X)) 
+ E{{aS - A • sign(5))2/(|5| ^ fl-^A)) 
^[l]-2x[2] + [3]. 



First 



[1] - a2^2 + (a + 5 _ 1)2(1 + ^2) ^ (^ _ ^)2^2(^ ^ ^2) 

= a^^^ + {a + b- 1)2(1 + cr^) + b. 
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Next using <J>(a;) = 1 — $(2;), 

[2] ^ E{[a5 + {a + b - l)e] [a{S) - A • sign(S')]/(|S'| ^ a-^X)) 

= E{{a6{a6 - A • sign(S')) + e [a6a +{a + b- l){a5 - A • sign(S'))] +a{a + b~ l)e^}I{\S\ ^ a^^A)) 
= E(a(5(a(5 - A • sign(S'))/(|S'| ^ fl-^A)) 

+ E([a^(5 + (a + 6 - l)(a^ - A • sign(5))] eI{\S\ > fl-^A)) 
+ E(a(a + 6 - l)e^/(|5| > fl-^A)) 

i"^A-J\ ., . ,. /-a-iA-J' 



a5(a(5 - A)$ 



Vi 



+ aS{aS + \)^ 



VT 



+ [a^S + {a + b~l){a6~ A)]E(e/(S' ^ a^^A)) 
+ [a^S + {a + b ~ l){aS + A)]E(e/(S' < -a^^A)) 
+ a{a + b- l)E(e2/(|5| ^ a-^A)). 



Recall that we defined x = a ^A/\/l + ct^ and (5 = <5/\/l + cr^. Now using 
Lemma [2] 



E(e2/(|5| ^fl-^A)) = (1 + ct2)e ^2/ 



IX 



> 



v/(iT^' 7(1+^ 



= (l + a2)[l-.g(.T-^)+g(- 



^)]. 



Next 



So, 



E(e/(|5| ^ a"U)) = E{eI{S ^ a^^A)) + E(e/(S' :$ -a^^A)) 
= -E(e/(S' s^ a-^X)) + E{eI{S s^ -a^^A)) 
= vl + CT-^(/3(a; — (5) — V 1 + (t2(^(_;J _ ^), 



[2] = a(5(a(5 - A)'l(x -6) + aS{aS + A)$(-x - S) 
+ [a^S +ia + b-l){aS- A)] \/l+~a^ip{x - S) 
- [a^S + (a + b - l){a5 + X)]^/lVa^(p{-x - S) 
+ a{a + b- 1)(1 + o-2)[i _ g(j _ J) + 5(_;j _ ^)]. 
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Finally, 

[3] = E{[aiS) - A • sign(5)]2/(|5| ^ a-'X)) 

= E{[a^e^ + 2a{ad - A • sign(5))e + {a6 - X ■ sign{S))^]I{\S\ ^ a-^X)) 
= E{a'^e^I{\S\ ^a-^X)) 

+ 2E{a{aS - X ■ sign(S'))e/(|S'| ^ a'^X)) 

+ E{{aS - X ■ sign{S)fl{\S\ ^ a-^X)) 
= a\l + cr2)[l _ g(f -S)+ g{-i - ~5)] 



+ 2a{a5 - X)\/l + a^ip^x ~ 5) ~ 2a{aS + X)\/l + a^ip{-x - S) 
+ {aS - A)2|.(x - (5) + (ad + X)^^{-x - S). 

Hence, the MSE is 

E(/i2) = [l]-2x[2] + [3] 

~ a{a + 2b~ 2)(1 + a^)[l - g{S: - (5) + g{-S: - S)] 



[2aX + 2{a + b- l){aS - X)]\/l + a^^{S: - S) 



- [2aX -2{a + b- l){a5 + A)]Vl + crV(-i " 5) 
~ {a6 - X){ab + A)[l - ^{x - 5) + $(-£ - l)\. U 

8.4 Supporting lemmas for inadmissibility 

In this section we first recall Stein's Lemma. Then we prove two technical 
lemmas used in the proof of Theorem [5J 

Lemma 3. Let Z ^ A/'(0, 1) and fei 5 : M — > M be an indefinite integral of the 
Lebesgue measurable function g' , essentially the derivative of g. IfE{\g'{Z)\) < 
00 then 

E{g'{Z))^E{Zg{Z)). 



Proof Stein (1981). D 



Lemma 4. Let 77 ~ A/'(0, Id), b G W'-, and let A > and B > be constants. 
Let 



The 



\\b-v¥ + B 



V \\b-v\\^ + B J \{\\b ~ r,\\^ + B)^ 



< d + E 



A{A + 4- 2d) 
Wb-rjW^ + B 
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Proof. First, 

Now define 

h -Vk h - Vk 



gink) 



Ib-r^W^ + B {bk - m? + \\b-k - ^-kV +B' 
By Stein's lemma (Lemma p|, we have 



and thus 

Edl^lp) = d + E ' ^ ^" '" 



V (||6-?7||2 + B)2 ||6-r/||2 + B 

/(4A + ^2)||5_^||2 2Ad(||6-77||2+S^ 



V (||6-r?||2 + S)2 (||6-77||2 + S)2 

/ (4^ + yl2 _ 2Ad) (4A + A2)S 



= d + E 

= d + ^y ||6-7?||2 + S (||6_^||2 + s)2^ 

after collecting terms. D 

Lemma 5. -For integer m ^ 1, Zei Q ^ xJm)' C > ^, D > and put 

Q{C-m-^Q) 
Q + D 
Then 

^ ' m + 2 + D 
and so E{Z) > whenever C > 1 + 2/m. 

Proof. The xfm) density function is p„(x) = (2™/2-ir(f ))-ix"/2-ie-^/2. 
Thus 

E(Z) = ^- r ^('^-'^'^^) ^m/2-l -x/2j 

1 Z""" C'-m- a::^(„_^2)/2-ig-x/2 j^ 



Pm+2(a;)dx 



2'"/2+ir(™±2) p (7 _ rn-ix 



2W2r(^) 7o x + D 
m / , „ p,n+2{x)dx 



> m 



JO 2; + ^ 

C- (m + 2)/TO 



TO + 2 + D 
by Jensen's inequality. D 
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8.5 Proof of Theorem [5l 

We prove this first for Wpiug,/i ~ taping, that is, taking h{a\) = dcr^jn. We also 
assume at first that S^ — S. 

Note that /3s = /3 + (XlXsY^Xlss and /3b = /3 + 7 + {XIXbY^XIeb. 
It is convenient to define 



r\s 



Y}I-'{xIXs)-^xIes and r]B ^^^^""{XIXbY^XIeb. 



Then we can rewrite /35 = /3 + E ^l'^r\s and /3b = /3 + 7 + S ^^^r]B- Similarly, 
we let 



.2 \\Ys-Xsf3sr . .2 l|yB-^i?/3Bir 

(To = and ctr ~ . 

^ n-d ^ N-d 

Now {ris,rjB,^s,d''^) are mutually independent, with 

2 2 

^9 'i'2 1 •--9 R9 

We easily find that E(||X/35 — X^||^) = da^/n. Next we find a) and a bound 

onE(||X/3(a;)-X/3||2). 

Let 7* = Si/2^ so that 7 = /3b - /3s = S^^/^^^* + ^^^ _ r^^). Then 

7TE7 + da|/iV 



II7* + ??i3 - ??5||2 + d{al/N + alln) ' 
Now we can express the mean squared error as 

E(||X;9(c2;) - Xf3f) = E{\\XJ:-'^\u;7js + (1 - ^)(7* + VB))f) 
= Ei\\u;rjs + {l-uj){Y+VBW) 
= Ei\\r]s + il-L^)il*+VB-VsW) 

(7* + VB - ■qs)d^'s/n 



= E 



'^''^ Wr+riB-risV+diallN + al/n) 



To simplify the expression for mean squared error we introduce 

/O -2/2 2 

V*s = Vnvs/crs ^ A/'(0, /d), 
& = \/ri(7* +VB)/crs, 
A = d&g/ag = dQ/m, and 
B = nd{a%/N + a^l/n)/al 
^d{{n/N)al/al + Q/m). 
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The quantities A and B are, after conditioning, the constants that appear in 
technical Lemmajl] Similarly C and D introduced below match the constants 
used in Lemma [S] 

With these substitutions and some algebra, 



E{\\X/3{lu)-XI3\\^) = ^E 
n 



"Is 



-^E E 



Vs 



Mb~V*s) 


2\ 


1 Mb-Vs 


/ 

) 


2 


\\b-f]:^p+B 



Vb, ^1,0-1 



We now apply two technical lemmas from Section [8. 4[ 

Since 77^ is independent of {b,A,B) and Q ^ xfm)' ^^ Lemma El we have 



E 



Vs 



A{b-Vs) 



v*s¥ 



B 



991 , / A(A + 4: 

r]B,(Js,crj^\ <d + E' 



2d) 



\b- 



Vs\ 



B 



VB,^l,^% 



Hence 



A = Ei\\Xl3s - XI3\\^) - Ei\\Xl3{u;) - X(5\\^) 
_al^fAi2d-A-A) 



■E 



n 



-E 



dal 



'-E 



— ^E 



\\\h-^%r + B 

{dQ/m){2d-dQ/m-'i) 
\\b-r]*g\\^ + {B-A)+dQ/m) 
Q{2 - Q/m - 4/d) 



\\b~ 77* IPm/d + (S - A)m/d + Q 
/ QjC-Q/m) 



V Q + D 



where C = 2 - 4/d and £> = {m/d){\\b - 77* ||2 + dnN-'^a%/al). 

Now suppose that d ^ 5. Then C ^ 2 — 4/5 > 1 and so conditionally on 
''7s, Vb, and &g, the requirements of Lemma p^ are satisfied by C, D and Q. 
Therefore 



A> 



dal^ f m{\-A/d)-2 
■m + 2 + D 



(27) 



where the randomness in (27) is only through D which depends on r/g, rjs 
(through 6) and a^. By Jensen's inequality 



A> 



dal ™(1 - 4/rf) - 2 
~n' m + 2 + E{D) 



^0 



(28) 



whenever m(l — 4/(i) ^ 2. The first inequality in (28) is strict because var(£') > 
0. Therefore A > 0. The condition on m and d holds for any to ^ 10 when 
d^ 5. 
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For the general plug-in ojpiug,h we replace dag/N above by h{ag). This 
quantity depends on a^ and is independent of (t|, tjb and r]s. It appears within 
B where we need it to be non-negative in order to apply Lemma |4] It also 
appears within D which becomes {m/d){\\b — Ty^p -I- nh{a'^)/ag). Even when 



we take var{h{ag)) = we still get var(I?) > and so the first inequality in (28 1 
is still strict. 

Now suppose that S^ is not equal to S. The distributions of rjs, ^g and a'^ 
remain unchanged but now 



7yB-^(0,^Si/2E^iSi/2 



independently of the others. The changed distribution of rjB does not affect 
the application of Lemma [4] because that lemma is invoked conditionally on r]B- 
Similarly, Lemma [5] is applied conditionally on tjb- The changed distribution of 



r]B changes the distribution of D but we can still apply ( 28 ) . D 
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